Biostat 212a Homework 1

Due Jan 23, 2024 @ 11:59PM

Author

Yuhui Wang, UID: 606332401

Published

January 23, 2024

1 Filling gaps in lecture notes (10pts)

Consider the regression model \[ Y = f(X) + \epsilon, \] where \(\operatorname{E}(\epsilon) = 0\).

1.1 Optimal regression function

Show that the choice \[ f_{\text{opt}}(X) = \operatorname{E}(Y | X) \] minimizes the mean squared prediction error \[ \operatorname{E}[Y - f(X)]^2, \] where the expectations averages over variations in both \(X\) and \(Y\). (Hint: condition on \(X\).)

Answer: Since \(\operatorname{E}(\epsilon) = 0\), we have: \[ \begin{align} \operatorname{E}[Y - f(X)]^2 &= {E}[Y^2-2Yf(X)+f^2(X)] \\ &= \operatorname{E}[Y^2] + \operatorname{E}[f^2(X)-2Yf(X)] \\ \end{align} \] To minimize the mean squared prediction error, we need to minimize \(\operatorname{E}[f^2(X)-2Yf(X)]\). Then we take the derivative of \(\operatorname{E}[f^2(X)-2Yf(X)]\) with respect to \(f(X)\) and set it to 0, which is: \[ 2f(X) - 2Y = 0 \] Therefore, the result for \(f_{\text{opt}}(X)\) is: \[ \begin{align} f_{\text{opt}}(X) &= Y \\ &= \operatorname{E}(Y | X) \end{align} \]

1.2 Bias-variance trade-off

Given an estimate \(\hat f\) of \(f\), show that the test error at a \(x_0\) can be decomposed as \[ \operatorname{E}[y_0 - \hat f(x_0)]^2 = \underbrace{\operatorname{Var}(\hat f(x_0)) + [\operatorname{Bias}(\hat f(x_0))]^2}_{\text{MSE of } \hat f(x_0) \text{ for estimating } f(x_0)} + \underbrace{\operatorname{Var}(\epsilon)}_{\text{irreducible}}, \] where the expectation averages over the variability in \(y_0\) and \(\hat f\).

Answer: \[ \begin{align} \operatorname{E}[y_0 - \hat f(x_0)]^2 &= \operatorname{E}[(y_0 - f(x_0) + f(x_0) - \hat f(x_0))^2] \\ &= \operatorname{E}[(y_0 - f(x_0))^2] + 2\operatorname{E}[(y_0 - f(x_0)(f(x_0) - \hat f(x_0)] + \operatorname{E}[(f(x_0) - \hat f(x_0))^2 ] \end{align} \] For this equation, the first term is: \[ \begin{align} \operatorname{E}[(y_0 - f(x_0))^2] &= \operatorname{E}[(f(x_0) + \epsilon - f(x_0))^2] \\ &= \operatorname{E}[\epsilon^2] \\ \end{align} \] For the second term, since \[y_0 - f(x_0) = \epsilon\] and \[\operatorname{E}(\epsilon) = 0\], we have: \[ \begin{align} \operatorname{E}[(y_0 - f(x_0)(f(x_0) - \hat f(x_0)] = 0 \end{align} \] For the third term, we have: \[ \begin{align} \operatorname{E}[(f(x_0) - \hat f(x_0))^2 ] &= \operatorname{E}[(f(x_0) - \operatorname{E}(\hat f(x_0)) + \operatorname{E}(\hat f(x_0)) - \hat f(x_0))^2 ] \\ &= \operatorname{E}[(f(x_0) - \operatorname{E}(\hat f(x_0)))^2] + 2\operatorname{E}[(f(x_0) - \operatorname{E}(\hat f(x_0)))(\operatorname{E}(\hat f(x_0)) - \hat f(x_0))] + \operatorname{E}[(\operatorname{E}(\hat f(x_0)) - \hat f(x_0))^2] \\ &= \operatorname{E}[(f(x_0) - \operatorname{E}(\hat f(x_0)))^2] + \operatorname{E}[(\operatorname{E}(\hat f(x_0)) - \hat f(x_0))^2] \\ &= \operatorname{Var}(\hat f(x_0)) + [\operatorname{Bias}(\hat f(x_0))]^2 \end{align} \] Therefore, we have: \[ \begin{align} \operatorname{E}[y_0 - \hat f(x_0)]^2 &= \operatorname{E}[\epsilon^2] + \operatorname{Var}(\hat f(x_0)) + [\operatorname{Bias}(\hat f(x_0))]^2 \\ &= \operatorname{Var}(\epsilon) + \operatorname{Var}(\hat f(x_0)) + [\operatorname{Bias}(\hat f(x_0))]^2 \end{align} \]

2 ISL Exercise 2.4.3 (10pts)

  1. We now revisit the bias-variance decomposition.
  1. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

Answer:

# Define the model flexibility range
flexibility <- seq(1, 100, length.out = 400)

# Assuming some arbitrary functions for demonstration purposes:
squared_bias <- (1 / (1 + exp(0.1 * (flexibility - 50))))^2
variance <- (1 - 1 / (1 + exp(0.1 * (flexibility - 50))))^2
training_error <- 1 / (1 + exp(0.1 * (flexibility - 20)))
test_error <- training_error - 0.1 + variance + squared_bias
bayes_error <- rep(0.2, 400)

# Plot
plot(flexibility, training_error, type = "l", col = "red", ylim = c(0, 1), xlab = "Flexibility", ylab = "Error", main = "Bias-Variance Tradeoff")
lines(flexibility, test_error, col = "orange")
lines(flexibility, squared_bias, col = "green")
lines(flexibility, variance, col = "blue")
lines(flexibility, bayes_error, col = "purple")

# Add legend
legend("topright", inset=.05, cex=0.8, title="Error Type", c("Training", "Test", "Squared Bias", "Variance", "Bayes"), fill=c("red", "orange", "green", "blue", "purple"), horiz=FALSE)

  1. Explain why each of the five curves has the shape displayed in part (a).

Answer: Squared Bias: Decreases with flexibility since more flexible models can adapt better to complex data patterns.

Variance: Increases with flexibility as more flexible models are more sensitive to fluctuations in the training data.

Training Error: Decreases with flexibility as the model fits the training data more closely.

Test Error: Initially decreases with increasing flexibility but then increases due to overfitting (reflected in the U-shape).

Bayes Error: Remains constant as it represents the error inherent in the data that cannot be reduced by any model.

3 ISL Exercise 2.4.4 (10pts)

  1. You will now think of some real-life applications for statistical learning.
  1. Describe three real-life applications in which classifcation might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
  2. Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
  3. Describe three real-life applications in which cluster analysis might be useful.

Answer:

  1. Classification:
  1. Medical Diagnosis

Response: Whether a patient has a certain disease or not

Predictors: symptoms, disease history, demographic information

Goal: It can be both a prediction problem and an inference problem. We can use the model to predict whether a patient has the disease or not. We can also use the model to identify the which predictors are significant for the disease.

  1. Spam Detection

Response: Whether an email is spam or not

Predictors: email content, email address format, email time

Goal: It is a prediction problem. We want to use the model to predict whether an email is a spam or not.

  1. Credit Loan Approval

Response: Whether a person is qualified for a credit loan or not

Predictors: credit history, income, demographic information

Goal: It is a prediction problem. We want to use the model to predict whether a person is qualified for a credit loan or not.

  1. Regression:
  1. Housing Price Prediction Response: Housing price Predictors: house size, house location, house age Goal: It is a prediction problem. We want to use the model to predict the housing price.
  2. Auto Insurance Price Prediction Response: Auto insurance price Predictors: car model, car age, car owner’s age Goal: It is a prediction problem. We want to use the model to predict the auto insurance price.
  3. Stock Price Prediction Response: Stock price Predictors: stock history price, stock company’s financial report, stock company’s news Goal: It is a prediction problem. We want to use the model to predict the stock price.
  1. Cluster Analysis:
  1. Crime Analysis Group crime data into different clusters based on the crime type, crime location, crime time, etc.

  2. Customer Segmentation Group customers into different clusters based on their demographic information, purchase history, etc.

  3. Document Classification Group documents into different clusters based on their content, type, author, etc.

4 ISL Exercise 2.4.10 (30pts)

Your can read in the boston data set directly from url https://raw.githubusercontent.com/biostat212a/2024winter/master/slides/data/Boston.csv. A documentation of the boston data set is here.

This exercise involves the Boston housing data set. (a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library.

library(ISLR2)
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
?Boston
dim(Boston)
[1] 506  13

How many rows are in this data set? How many columns? What do the rows and columns represent?

Answer: There are 506 rows and 13 columns in the data set. The rows represent the census tracts in Boston (or the observations). The columns represent the predictors.

  1. Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

Answer:

library(GGally)
pairs(Boston)

pairs(~crim+nox+dis+tax+medv, data = Boston)

From the graph, we can see that some predictors have a strong relationship, for example, crim have a negative linear relationship with medv and dis. Also, nox has a negative linear relationship with dis, and dis has a positive linear relationship with medv. However, some predictors have a non-linear relationship, such as dis and age.

  1. Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

Answer: Yes. From the graph, we can see that crim have a strong negative relationship with medv and dis. When crim increases, medv and dis decreases. Also, crim has a positive linear relationship with nox and tax. When crim increases, nox and tax increases.

  1. Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

Answer:

plot(Boston$crim)

plot(Boston$tax)

plot(Boston$ptratio)

range(Boston$crim)
[1]  0.00632 88.97620
range(Boston$tax)
[1] 187 711
range(Boston$ptratio)
[1] 12.6 22.0

Yes. From the graph, we can see that some census tracts have particularly high crime rates, tax rates, and pupil-teacher ratios. The range of each predictor is different. For example, the range of crim is from 0.00632 to 88.97620, the range of tax is from 187 to 711, and the range of ptratio is from 12.6 to 22.0.

  1. How many of the census tracts in this data set bound the Charles river?

Answer:

sum(Boston$chas == 1)
[1] 35

35 are bounded by the Charles river.

  1. What is the median pupil-teacher ratio among the towns in this data set?

Answer:

median(Boston$ptratio)
[1] 19.05

The median pupil-teacher ratio is 19.05.

  1. Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

Answer:

which(Boston$medv == min(Boston$medv))
[1] 399 406
Boston[which(Boston$medv == min(Boston$medv)), ]
       crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5
406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 22.98    5
pairs(Boston)

The census tracts of Boston that have the lowest median value of owner-occupied homes are 399 and 406.

The values of the other predictors for ‘399’ census tract are: crim = 38.3518, zn = 0, indus = 18.1, chas = 0, nox = 0.693, rm = 5.453, age = 100, dis = 1.4896, rad = 24, tax = 666, ptratio = 20.2, black = 396.9, lstat = 30.59, medv = 5.0.

For ‘406’, the values are: crim = 67.9208, zn = 0, indus = 18.1, chas = 0, nox = 0.693, rm = 5.683, age = 100, dis = 1.4254, rad = 24, tax = 666, ptratio = 20.2, lstat = 22.98, medv = 5.0. The values of the other predictors for these two census tracts are at the lower end of the range of the predictors.

For these two census tracts, compared to overall range, the values of crim, nox, age, dis, rad, tax, ptratio, and lstat are at the higher end of the range of the predictors. The value of rm is at the lower end of the range of the predictors.

  1. In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

Answer:

sum(Boston$rm > 7)
[1] 64
sum(Boston$rm > 8)
[1] 13
Boston[Boston$rm > 8, ]
       crim zn indus chas    nox    rm  age    dis rad tax ptratio lstat medv
98  0.12083  0  2.89    0 0.4450 8.069 76.0 3.4952   2 276    18.0  4.21 38.7
164 1.51902  0 19.58    1 0.6050 8.375 93.9 2.1620   5 403    14.7  3.32 50.0
205 0.02009 95  2.68    0 0.4161 8.034 31.9 5.1180   4 224    14.7  2.88 50.0
225 0.31533  0  6.20    0 0.5040 8.266 78.3 2.8944   8 307    17.4  4.14 44.8
226 0.52693  0  6.20    0 0.5040 8.725 83.0 2.8944   8 307    17.4  4.63 50.0
227 0.38214  0  6.20    0 0.5040 8.040 86.5 3.2157   8 307    17.4  3.13 37.6
233 0.57529  0  6.20    0 0.5070 8.337 73.3 3.8384   8 307    17.4  2.47 41.7
234 0.33147  0  6.20    0 0.5070 8.247 70.4 3.6519   8 307    17.4  3.95 48.3
254 0.36894 22  5.86    0 0.4310 8.259  8.4 8.9067   7 330    19.1  3.54 42.8
258 0.61154 20  3.97    0 0.6470 8.704 86.9 1.8010   5 264    13.0  5.12 50.0
263 0.52014 20  3.97    0 0.6470 8.398 91.5 2.2885   5 264    13.0  5.91 48.8
268 0.57834 20  3.97    0 0.5750 8.297 67.0 2.4216   5 264    13.0  7.44 50.0
365 3.47428  0 18.10    1 0.7180 8.780 82.9 1.9047  24 666    20.2  5.29 21.9
print("Boston")
[1] "Boston"
summary(Boston)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          lstat      
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
 Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
      medv      
 Min.   : 5.00  
 1st Qu.:17.02  
 Median :21.20  
 Mean   :22.53  
 3rd Qu.:25.00  
 Max.   :50.00  
print("Boston[Boston$rm > 8, ]")
[1] "Boston[Boston$rm > 8, ]"
summary(Boston[Boston$rm > 8, ])
      crim               zn            indus             chas       
 Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
 1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
 Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
 Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
 3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
 Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
      nox               rm             age             dis       
 Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
 1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
 Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
 Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
 3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
 Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
      rad              tax           ptratio          lstat           medv     
 Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
 1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
 Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
 Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
 3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
 Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0  
pairs(Boston)

64 census tracts have average of more than seven rooms per dwelling. 13 census tracts have average of more than eight rooms per dwelling. The census tracts that average more than eight rooms per dwelling have a low crime rate crim, a low pupil-teacher ratio lstat, and a high median value of owner-occupied homes medv.

Additional Information:

library(tidyverse)

Boston <- read_csv("https://raw.githubusercontent.com/ucla-biostat-212a/2024winter/master/slides/data/Boston.csv", col_select = -1) %>% 
  print(width = Inf)
# A tibble: 506 × 13
      crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio lstat
     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  4.98
 2 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2   242    17.8  9.14
 3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  4.03
 4 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3   222    18.7  2.94
 5 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3   222    18.7  5.33
 6 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3   222    18.7  5.21
 7 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5   311    15.2 12.4 
 8 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5   311    15.2 19.2 
 9 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5   311    15.2 29.9 
10 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5   311    15.2 17.1 
    medv
   <dbl>
 1  24  
 2  21.6
 3  34.7
 4  33.4
 5  36.2
 6  28.7
 7  22.9
 8  27.1
 9  16.5
10  18.9
# ℹ 496 more rows
import pandas as pd
import io
import requests

url = "https://raw.githubusercontent.com/ucla-econ-425t/2023winter/master/slides/data/Boston.csv"
s = requests.get(url).content
Boston = pd.read_csv(io.StringIO(s.decode('utf-8')), index_col = 0)
Boston
        crim    zn  indus  chas    nox  ...  rad  tax  ptratio  lstat  medv
1    0.00632  18.0   2.31     0  0.538  ...    1  296     15.3   4.98  24.0
2    0.02731   0.0   7.07     0  0.469  ...    2  242     17.8   9.14  21.6
3    0.02729   0.0   7.07     0  0.469  ...    2  242     17.8   4.03  34.7
4    0.03237   0.0   2.18     0  0.458  ...    3  222     18.7   2.94  33.4
5    0.06905   0.0   2.18     0  0.458  ...    3  222     18.7   5.33  36.2
..       ...   ...    ...   ...    ...  ...  ...  ...      ...    ...   ...
502  0.06263   0.0  11.93     0  0.573  ...    1  273     21.0   9.67  22.4
503  0.04527   0.0  11.93     0  0.573  ...    1  273     21.0   9.08  20.6
504  0.06076   0.0  11.93     0  0.573  ...    1  273     21.0   5.64  23.9
505  0.10959   0.0  11.93     0  0.573  ...    1  273     21.0   6.48  22.0
506  0.04741   0.0  11.93     0  0.573  ...    1  273     21.0   7.88  11.9

[506 rows x 13 columns]

5 ISL Exercise 3.7.3 (12pts)

  1. Suppose we have a data set with fve predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to ft the model, and get βˆ0 = 50, βˆ1 = 20, βˆ2 = 0.07, βˆ3 = 35, βˆ4 = 0.01, βˆ5 = −10.
  1. Which answer is correct, and why?
  1. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.

  2. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.

  3. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates provided that the GPA is high enough.

  4. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

Answer: ‘iv’ is correct, and here is the reason:

Salary for high school graduates: 50 + 20* GPA + 0.07 * IQ + 0.01 * (IQ * GPA)

Salary for college graduates: 50 + 20* GPA + 0.07 * IQ + 35 + 0.01 * (IQ * GPA) - 10 * GPA = 85 + 10* GPA + 0.07 * IQ + 0.01 * (IQ * GPA)

When IQ and GPA are fixed, Salary for college graduates - Salary for high school graduates = 35 - 10 * GPA. If GPA is high enough (for example, 4), Salary for college graduates minus Salary for high school graduates will be less than 0.

Therefore, high school graduates earn more, on average, than college graduates provided that the GPA is high enough.

  1. Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

Answer:

salary <- 50 + 20 * 4 + 0.07 * 110 + 35 + 0.01 * 4 * 110 - 10 * 4
salary
[1] 137.1

The salary is 137.1 K.

  1. True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

Answer: False. The coefficient represents a slope or a rate of change. It does not affect the statistical significance of the interaction effect. P-value is the criteria to determine the statistical significance.

6 ISL Exercise 3.7.15 (20pts)

  1. This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
  1. For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

Answer:

library(ISLR2)
library(ggplot2)
predictors <- colnames(Boston)[colnames(Boston) != "crim"]
models <- list()
for (pred in predictors) {
  model_formula <- as.formula(paste("crim ~", pred))
  models[[pred]] <- lm(model_formula, data = Boston)
}

for (i in predictors) {
  print(i)
  print(summary(models[[i]]))
}
[1] "zn"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-4.429 -4.222 -2.620  1.250 84.523 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
zn          -0.07393    0.01609  -4.594 5.51e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.435 on 504 degrees of freedom
Multiple R-squared:  0.04019,   Adjusted R-squared:  0.03828 
F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06

[1] "indus"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.972  -2.698  -0.736   0.712  81.813 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
indus        0.50978    0.05102   9.991  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.866 on 504 degrees of freedom
Multiple R-squared:  0.1653,    Adjusted R-squared:  0.1637 
F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16

[1] "chas"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-3.738 -3.661 -3.435  0.018 85.232 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7444     0.3961   9.453   <2e-16 ***
chas         -1.8928     1.5061  -1.257    0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.597 on 504 degrees of freedom
Multiple R-squared:  0.003124,  Adjusted R-squared:  0.001146 
F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

[1] "nox"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.371  -2.738  -0.974   0.559  81.728 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
nox           31.249      2.999  10.419  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.81 on 504 degrees of freedom
Multiple R-squared:  0.1772,    Adjusted R-squared:  0.1756 
F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16

[1] "rm"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.604 -3.952 -2.654  0.989 87.197 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.482      3.365   6.088 2.27e-09 ***
rm            -2.684      0.532  -5.045 6.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.401 on 504 degrees of freedom
Multiple R-squared:  0.04807,   Adjusted R-squared:  0.04618 
F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07

[1] "age"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.789 -4.257 -1.230  1.527 82.849 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
age          0.10779    0.01274   8.463 2.85e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.057 on 504 degrees of freedom
Multiple R-squared:  0.1244,    Adjusted R-squared:  0.1227 
F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16

[1] "dis"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.708 -4.134 -1.527  1.516 81.674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4993     0.7304  13.006   <2e-16 ***
dis          -1.5509     0.1683  -9.213   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.965 on 504 degrees of freedom
Multiple R-squared:  0.1441,    Adjusted R-squared:  0.1425 
F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16

[1] "rad"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.164  -1.381  -0.141   0.660  76.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
rad          0.61791    0.03433  17.998  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.718 on 504 degrees of freedom
Multiple R-squared:  0.3913,    Adjusted R-squared:   0.39 
F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

[1] "tax"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.513  -2.738  -0.194   1.065  77.696 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
tax          0.029742   0.001847   16.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.997 on 504 degrees of freedom
Multiple R-squared:  0.3396,    Adjusted R-squared:  0.3383 
F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16

[1] "ptratio"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-7.654 -3.985 -1.912  1.825 83.353 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
ptratio       1.1520     0.1694   6.801 2.94e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.24 on 504 degrees of freedom
Multiple R-squared:  0.08407,   Adjusted R-squared:  0.08225 
F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11

[1] "lstat"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.925  -2.822  -0.664   1.079  82.862 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
lstat        0.54880    0.04776  11.491  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.664 on 504 degrees of freedom
Multiple R-squared:  0.2076,    Adjusted R-squared:  0.206 
F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16

[1] "medv"

Call:
lm(formula = model_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-9.071 -4.022 -2.343  1.298 80.957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.79654    0.93419   12.63   <2e-16 ***
medv        -0.36316    0.03839   -9.46   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.934 on 504 degrees of freedom
Multiple R-squared:  0.1508,    Adjusted R-squared:  0.1491 
F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

Results: From the model summary, zn, indus, nox, rm, age, dis, rad, tax, ptratio, lstat have statistically significant association with crim. However, chas does not have statistically significant association with crim, as chas is a dummy variable.

Plots:

for (pred in predictors) {
  plot <- ggplot(Boston, aes_string(x = pred, y = "crim")) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    ggtitle(paste("CRIM vs", pred))
  
  print(plot)
}
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

  1. Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H0 : βj = 0?

Answer:

model_f <- as.formula(paste("crim ~", paste(predictors, collapse = "+")))
model <- lm(model_f, data = Boston)
summary(model)

Call:
lm(formula = model_f, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-8.534 -2.248 -0.348  1.087 73.923 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
zn           0.0457100  0.0187903   2.433 0.015344 *  
indus       -0.0583501  0.0836351  -0.698 0.485709    
chas        -0.8253776  1.1833963  -0.697 0.485841    
nox         -9.9575865  5.2898242  -1.882 0.060370 .  
rm           0.6289107  0.6070924   1.036 0.300738    
age         -0.0008483  0.0179482  -0.047 0.962323    
dis         -1.0122467  0.2824676  -3.584 0.000373 ***
rad          0.6124653  0.0875358   6.997 8.59e-12 ***
tax         -0.0037756  0.0051723  -0.730 0.465757    
ptratio     -0.3040728  0.1863598  -1.632 0.103393    
lstat        0.1388006  0.0757213   1.833 0.067398 .  
medv        -0.2200564  0.0598240  -3.678 0.000261 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.46 on 493 degrees of freedom
Multiple R-squared:  0.4493,    Adjusted R-squared:  0.4359 
F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

Results: We can reject the null hypothesis for zn, dis, rad, medv at alpha = 0.05 level, and we can reject the null hypothesis for nox, lstat at alpha = 0.1 level. Other predictors do not have statistically significant association with crim at alpha = 0.01 level.

  1. How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.

Answer:

predictors <- colnames(Boston)[colnames(Boston) != "crim"]
simple_models <- lapply(predictors, function(pred) {
  lm(as.formula(paste("crim ~", pred)), data = Boston)
})

# Extract coefficients from simple linear regression models
simple_coefs <- sapply(simple_models, function(model) coef(model)[2])

# Fit multiple linear regression model with all predictors
multiple_model <- lm(crim ~ ., data = Boston)
multiple_coefs <- coef(multiple_model)[-1] # Exclude intercept

# Create a data frame for plotting
coef_comparison <- data.frame(
  simple_df = simple_coefs,
  multiple_df = multiple_coefs[names(simple_coefs)],
  Predictor = names(simple_coefs)
)

For each predictors with crim, they have a statistically significant association with crim at alpha = 0.05 level. However, in the multiple regression model, only zn, dis, rad, medv have statistically significant association with crim at alpha = 0.05 level. Other predictors do not have statistically significant association with crim at alpha = 0.05 level.

# Plot the comparison of coefficients
ggplot(coef_comparison, aes(x = simple_df, y = multiple_df)) +
  geom_point() +
  geom_text(aes(label = Predictor), vjust = 1.5, hjust = 1.5, check_overlap = TRUE) +
  xlab("Coefficient in Simple Linear Regression") +
  ylab("Coefficient in Multiple Linear Regression") +
  ggtitle("Comparison of Regression Coefficients") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  1. Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form Y = β0 + β1X + β2X2 + β3X3 + epsilon.

Answer:

# Initialize a list to store polynomial models
poly_models <- list()

# Fit a cubic polynomial model for each predictor
for (pred in predictors) {
  poly_formula <- as.formula(paste("crim ~ poly(", pred, ", 3, raw = TRUE)"))
  poly_models[[pred]] <- lm(poly_formula, data = Boston)
}

# Summarize and check each polynomial model
for (pred in predictors) {
  print(paste("Polynomial Model Summary for Predictor:", pred))
  print(summary(poly_models[[pred]]))
}
[1] "Polynomial Model Summary for Predictor: zn"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-4.821 -4.614 -1.294  0.473 84.130 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.846e+00  4.330e-01  11.192  < 2e-16 ***
poly(zn, 3, raw = TRUE)1 -3.322e-01  1.098e-01  -3.025  0.00261 ** 
poly(zn, 3, raw = TRUE)2  6.483e-03  3.861e-03   1.679  0.09375 .  
poly(zn, 3, raw = TRUE)3 -3.776e-05  3.139e-05  -1.203  0.22954    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.372 on 502 degrees of freedom
Multiple R-squared:  0.05824,   Adjusted R-squared:  0.05261 
F-statistic: 10.35 on 3 and 502 DF,  p-value: 1.281e-06

[1] "Polynomial Model Summary for Predictor: indus"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-8.278 -2.514  0.054  0.764 79.713 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  3.6625683  1.5739833   2.327   0.0204 *  
poly(indus, 3, raw = TRUE)1 -1.9652129  0.4819901  -4.077 5.30e-05 ***
poly(indus, 3, raw = TRUE)2  0.2519373  0.0393221   6.407 3.42e-10 ***
poly(indus, 3, raw = TRUE)3 -0.0069760  0.0009567  -7.292 1.20e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.423 on 502 degrees of freedom
Multiple R-squared:  0.2597,    Adjusted R-squared:  0.2552 
F-statistic: 58.69 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: chas"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-3.738 -3.661 -3.435  0.018 85.232 

Coefficients: (2 not defined because of singularities)
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  3.7444     0.3961   9.453   <2e-16 ***
poly(chas, 3, raw = TRUE)1  -1.8928     1.5061  -1.257    0.209    
poly(chas, 3, raw = TRUE)2       NA         NA      NA       NA    
poly(chas, 3, raw = TRUE)3       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.597 on 504 degrees of freedom
Multiple R-squared:  0.003124,  Adjusted R-squared:  0.001146 
F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

[1] "Polynomial Model Summary for Predictor: nox"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-9.110 -2.068 -0.255  0.739 78.302 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 233.09      33.64   6.928 1.31e-11 ***
poly(nox, 3, raw = TRUE)1 -1279.37     170.40  -7.508 2.76e-13 ***
poly(nox, 3, raw = TRUE)2  2248.54     279.90   8.033 6.81e-15 ***
poly(nox, 3, raw = TRUE)3 -1245.70     149.28  -8.345 6.96e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.234 on 502 degrees of freedom
Multiple R-squared:  0.297, Adjusted R-squared:  0.2928 
F-statistic: 70.69 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: rm"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.485  -3.468  -2.221  -0.015  87.219 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)  
(Intercept)              112.6246    64.5172   1.746   0.0815 .
poly(rm, 3, raw = TRUE)1 -39.1501    31.3115  -1.250   0.2118  
poly(rm, 3, raw = TRUE)2   4.5509     5.0099   0.908   0.3641  
poly(rm, 3, raw = TRUE)3  -0.1745     0.2637  -0.662   0.5086  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.33 on 502 degrees of freedom
Multiple R-squared:  0.06779,   Adjusted R-squared:  0.06222 
F-statistic: 12.17 on 3 and 502 DF,  p-value: 1.067e-07

[1] "Polynomial Model Summary for Predictor: age"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-9.762 -2.673 -0.516  0.019 82.842 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)   
(Intercept)               -2.549e+00  2.769e+00  -0.920  0.35780   
poly(age, 3, raw = TRUE)1  2.737e-01  1.864e-01   1.468  0.14266   
poly(age, 3, raw = TRUE)2 -7.230e-03  3.637e-03  -1.988  0.04738 * 
poly(age, 3, raw = TRUE)3  5.745e-05  2.109e-05   2.724  0.00668 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.84 on 502 degrees of freedom
Multiple R-squared:  0.1742,    Adjusted R-squared:  0.1693 
F-statistic: 35.31 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: dis"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.757  -2.588   0.031   1.267  76.378 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                30.0476     2.4459  12.285  < 2e-16 ***
poly(dis, 3, raw = TRUE)1 -15.5543     1.7360  -8.960  < 2e-16 ***
poly(dis, 3, raw = TRUE)2   2.4521     0.3464   7.078 4.94e-12 ***
poly(dis, 3, raw = TRUE)3  -0.1186     0.0204  -5.814 1.09e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.331 on 502 degrees of freedom
Multiple R-squared:  0.2778,    Adjusted R-squared:  0.2735 
F-statistic: 64.37 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: rad"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.381  -0.412  -0.269   0.179  76.217 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)
(Intercept)               -0.605545   2.050108  -0.295    0.768
poly(rad, 3, raw = TRUE)1  0.512736   1.043597   0.491    0.623
poly(rad, 3, raw = TRUE)2 -0.075177   0.148543  -0.506    0.613
poly(rad, 3, raw = TRUE)3  0.003209   0.004564   0.703    0.482

Residual standard error: 6.682 on 502 degrees of freedom
Multiple R-squared:    0.4, Adjusted R-squared:  0.3965 
F-statistic: 111.6 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: tax"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.273  -1.389   0.046   0.536  76.950 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                1.918e+01  1.180e+01   1.626    0.105
poly(tax, 3, raw = TRUE)1 -1.533e-01  9.568e-02  -1.602    0.110
poly(tax, 3, raw = TRUE)2  3.608e-04  2.425e-04   1.488    0.137
poly(tax, 3, raw = TRUE)3 -2.204e-07  1.889e-07  -1.167    0.244

Residual standard error: 6.854 on 502 degrees of freedom
Multiple R-squared:  0.3689,    Adjusted R-squared:  0.3651 
F-statistic:  97.8 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: ptratio"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.833 -4.146 -1.655  1.408 82.697 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)   
(Intercept)                   477.18405  156.79498   3.043  0.00246 **
poly(ptratio, 3, raw = TRUE)1 -82.36054   27.64394  -2.979  0.00303 **
poly(ptratio, 3, raw = TRUE)2   4.63535    1.60832   2.882  0.00412 **
poly(ptratio, 3, raw = TRUE)3  -0.08476    0.03090  -2.743  0.00630 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.122 on 502 degrees of freedom
Multiple R-squared:  0.1138,    Adjusted R-squared:  0.1085 
F-statistic: 21.48 on 3 and 502 DF,  p-value: 4.171e-13

[1] "Polynomial Model Summary for Predictor: lstat"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.234  -2.151  -0.486   0.066  83.353 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)  
(Intercept)                  1.2009656  2.0286452   0.592   0.5541  
poly(lstat, 3, raw = TRUE)1 -0.4490656  0.4648911  -0.966   0.3345  
poly(lstat, 3, raw = TRUE)2  0.0557794  0.0301156   1.852   0.0646 .
poly(lstat, 3, raw = TRUE)3 -0.0008574  0.0005652  -1.517   0.1299  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.629 on 502 degrees of freedom
Multiple R-squared:  0.2179,    Adjusted R-squared:  0.2133 
F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

[1] "Polynomial Model Summary for Predictor: medv"

Call:
lm(formula = poly_formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.427  -1.976  -0.437   0.439  73.655 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                53.1655381  3.3563105  15.840  < 2e-16 ***
poly(medv, 3, raw = TRUE)1 -5.0948305  0.4338321 -11.744  < 2e-16 ***
poly(medv, 3, raw = TRUE)2  0.1554965  0.0171904   9.046  < 2e-16 ***
poly(medv, 3, raw = TRUE)3 -0.0014901  0.0002038  -7.312 1.05e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.569 on 502 degrees of freedom
Multiple R-squared:  0.4202,    Adjusted R-squared:  0.4167 
F-statistic: 121.3 on 3 and 502 DF,  p-value: < 2.2e-16

Yes. In general, there is evidence of non-linear association between crim and indus, nox, age, dis, ptratio, medv.

For degree 1 coefficients, at alpha = 0.05 level, zn, indus, nox, dis, ptratio, medv have statistically significant association with crim.

For degree 2 coefficients, at alpha = 0.05 level, indus, nox, age, dis, ptratio, medv have statistically significant association with crim.

For degree 3 coefficients, at alpha = 0.05 level, indus, nox, age, dis, ptratio, medv have statistically significant association with crim.

7 Bonus question (20pts)

For multiple linear regression, show that \(R^2\) is equal to the correlation between the response vector \(\mathbf{y} = (y_1, \ldots, y_n)^T\) and the fitted values \(\hat{\mathbf{y}} = (\hat y_1, \ldots, \hat y_n)^T\). That is \[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = [\operatorname{Cor}(\mathbf{y}, \hat{\mathbf{y}})]^2. \]

Answer: \(R^2\), in the context of multiple linear regression, is defined as: \[ \begin{equation} R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}, \end{equation} \] where RSS is the Residual Sum of Squares and TSS is the Total Sum of Squares.

The Pearson correlation coefficient between two vectors \(\mathbf{y}\) and \(\hat{\mathbf{y}}\) is defined as: \[ \begin{equation} \operatorname{Cor}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\sum_{i=1}^{n}(y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2} \sqrt{\sum_{i=1}^{n}(\hat{y}_i - \bar{\hat{y}})^2}}, \end{equation} \] where \(\bar{y}\) and \(\bar{\hat{y}}\) are the means of \(\mathbf{y}\) and \(\hat{\mathbf{y}}\), respectively.

The mean of the fitted values \(\bar{\hat{y}}\) equals the mean of the observed values \(\bar{y}\). Therefore, we can express the Total Sum of Squares (TSS) also in terms of the predicted values: \[ \begin{equation} \text{TSS} = \sum_{i=1}^{n}(\hat{y}_i - \bar{\hat{y}})^2 + \text{RSS}. \end{equation} \] Given that \(\bar{y} = \bar{\hat{y}}\), the correlation coefficient simplifies to: \[ \begin{equation} \operatorname{Cor}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\sqrt{\text{TSS} - \text{RSS}}}{\sqrt{\text{TSS}}}. \end{equation} \] Squaring this correlation coefficient, we obtain: \[ \begin{equation} [\operatorname{Cor}(\mathbf{y}, \hat{\mathbf{y}})]^2 = \left( \frac{\sqrt{\text{TSS} - \text{RSS}}}{\sqrt{\text{TSS}}} \right)^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} = R^2. \end{equation} \] Therefore, \(R^2\) is indeed equal to the square of the correlation between the observed values and the fitted values in multiple linear regression.